{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bike-sharing forecasting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial we're going to forecast the number of bikes in 5 bike stations from the city of Toulouse. We'll do so by building a simple model step by step. The dataset contains 182,470 observations. Let's first take a peak at the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-11-25T20:56:24.292416Z",
     "iopub.status.busy": "2020-11-25T20:56:24.291940Z",
     "iopub.status.idle": "2020-11-25T20:56:25.384948Z",
     "shell.execute_reply": "2020-11-25T20:56:25.384466Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading https://maxhalford.github.io/files/datasets/toulouse_bikes.zip (1.12 MB)\n",
      "Uncompressing into /home/runner/river_data/Bikes\n",
      "{'clouds': 75,\n",
      " 'description': 'light rain',\n",
      " 'humidity': 81,\n",
      " 'moment': datetime.datetime(2016, 4, 1, 0, 0, 7),\n",
      " 'pressure': 1017.0,\n",
      " 'station': 'metro-canal-du-midi',\n",
      " 'temperature': 6.54,\n",
      " 'wind': 9.3}\n",
      "Number of available bikes: 1\n"
     ]
    }
   ],
   "source": [
    "from pprint import pprint\n",
    "from river import datasets\n",
    "\n",
    "X_y = datasets.Bikes()\n",
    "\n",
    "for x, y in X_y:\n",
    "    pprint(x)\n",
    "    print(f'Number of available bikes: {y}')\n",
    "    break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start by using a simple linear regression on the numeric features. We can select the numeric features and discard the rest of the features using a `Select`. Linear regression is very likely to go haywire if we don't scale the data, so we'll use a `StandardScaler` to do just that. We'll evaluate the model by measuring the mean absolute error. Finally we'll print the score every 20,000 observations. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-11-25T20:56:25.416921Z",
     "iopub.status.busy": "2020-11-25T20:56:25.389806Z",
     "iopub.status.idle": "2020-11-25T20:56:41.567216Z",
     "shell.execute_reply": "2020-11-25T20:56:41.566740Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[20,000] MAE: 4.912727\n",
      "[40,000] MAE: 5.333554\n",
      "[60,000] MAE: 5.330948\n",
      "[80,000] MAE: 5.392313\n",
      "[100,000] MAE: 5.423059\n",
      "[120,000] MAE: 5.541223\n",
      "[140,000] MAE: 5.613023\n",
      "[160,000] MAE: 5.622428\n",
      "[180,000] MAE: 5.567824\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "MAE: 5.563893"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from river import compose\n",
    "from river import linear_model\n",
    "from river import metrics\n",
    "from river import evaluate\n",
    "from river import preprocessing\n",
    "from river import optim\n",
    "\n",
    "X_y = datasets.Bikes()\n",
    "\n",
    "model = compose.Select('clouds', 'humidity', 'pressure', 'temperature', 'wind')\n",
    "model |= preprocessing.StandardScaler()\n",
    "model |= linear_model.LinearRegression(optimizer=optim.SGD(0.001))\n",
    "\n",
    "metric = metrics.MAE()\n",
    "\n",
    "evaluate.progressive_val_score(X_y, model, metric, print_every=20_000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model doesn't seem to be doing that well, but then again we didn't provide a lot of features. Generally, a good idea for this kind of problem is to look at an average of the previous values. For example, for each station we can look at the average number of bikes per hour. To do so we first have to extract the hour from the  `moment` field. We can then use a `TargetAgg` to aggregate the values of the target."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-11-25T20:56:41.589751Z",
     "iopub.status.busy": "2020-11-25T20:56:41.572682Z",
     "iopub.status.idle": "2020-11-25T20:57:06.473133Z",
     "shell.execute_reply": "2020-11-25T20:57:06.472640Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[20,000] MAE: 3.721246\n",
      "[40,000] MAE: 3.829972\n",
      "[60,000] MAE: 3.845068\n",
      "[80,000] MAE: 3.910259\n",
      "[100,000] MAE: 3.888652\n",
      "[120,000] MAE: 3.923727\n",
      "[140,000] MAE: 3.980953\n",
      "[160,000] MAE: 3.950034\n",
      "[180,000] MAE: 3.934545\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "MAE: 3.933498"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from river import feature_extraction\n",
    "from river import stats\n",
    "\n",
    "X_y = iter(datasets.Bikes())\n",
    "\n",
    "def get_hour(x):\n",
    "    x['hour'] = x['moment'].hour\n",
    "    return x\n",
    "\n",
    "model = compose.Select('clouds', 'humidity', 'pressure', 'temperature', 'wind')\n",
    "model += (\n",
    "    get_hour |\n",
    "    feature_extraction.TargetAgg(by=['station', 'hour'], how=stats.Mean())\n",
    ")\n",
    "model |= preprocessing.StandardScaler()\n",
    "model |= linear_model.LinearRegression(optimizer=optim.SGD(0.001))\n",
    "\n",
    "metric = metrics.MAE()\n",
    "\n",
    "evaluate.progressive_val_score(X_y, model, metric, print_every=20_000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By adding a single feature, we've managed to significantly reduce the mean absolute error. At this point you might think that the model is getting slightly complex, and is difficult to understand and test. Pipelines have the advantage of being terse, but they aren't always to debug. Thankfully `river` has some ways to relieve the pain.\n",
    "\n",
    "The first thing we can do it to draw the pipeline, to get an idea of how the data flows through it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-11-25T20:57:06.477405Z",
     "iopub.status.busy": "2020-11-25T20:57:06.476928Z",
     "iopub.status.idle": "2020-11-25T20:57:06.588576Z",
     "shell.execute_reply": "2020-11-25T20:57:06.588163Z"
    }
   },
   "outputs": [
    {
     "data": {
      "image/svg+xml": [
       "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
       "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
       " \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
       "<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n",
       " -->\n",
       "<!-- Title: %3 Pages: 1 -->\n",
       "<svg width=\"682pt\" height=\"404pt\"\n",
       " viewBox=\"0.00 0.00 682.22 404.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
       "<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 400)\">\n",
       "<title>%3</title>\n",
       "<polygon fill=\"#ffffff\" stroke=\"transparent\" points=\"-4,4 -4,-400 678.2209,-400 678.2209,4 -4,4\"/>\n",
       "<!-- x -->\n",
       "<g id=\"node1\" class=\"node\">\n",
       "<title>x</title>\n",
       "<ellipse fill=\"none\" stroke=\"#000000\" cx=\"367.9827\" cy=\"-378\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"367.9827\" y=\"-374.3\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">x</text>\n",
       "</g>\n",
       "<!-- [&#39;clouds&#39;, &#39;humidity&#39;, &#39;pressure&#39;, &#39;temperature&#39;, &#39;wind&#39;] -->\n",
       "<g id=\"node2\" class=\"node\">\n",
       "<title>[&#39;clouds&#39;, &#39;humidity&#39;, &#39;pressure&#39;, &#39;temperature&#39;, &#39;wind&#39;]</title>\n",
       "<ellipse fill=\"none\" stroke=\"#000000\" cx=\"194.9827\" cy=\"-234\" rx=\"194.9654\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"194.9827\" y=\"-230.3\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">[&#39;clouds&#39;, &#39;humidity&#39;, &#39;pressure&#39;, &#39;temperature&#39;, &#39;wind&#39;]</text>\n",
       "</g>\n",
       "<!-- x&#45;&gt;[&#39;clouds&#39;, &#39;humidity&#39;, &#39;pressure&#39;, &#39;temperature&#39;, &#39;wind&#39;] -->\n",
       "<g id=\"edge1\" class=\"edge\">\n",
       "<title>x&#45;&gt;[&#39;clouds&#39;, &#39;humidity&#39;, &#39;pressure&#39;, &#39;temperature&#39;, &#39;wind&#39;]</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M351.0797,-363.9305C321.6754,-339.4552 260.7728,-288.7617 224.3409,-258.4369\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"226.444,-255.6336 216.519,-251.9261 221.9657,-261.0137 226.444,-255.6336\"/>\n",
       "</g>\n",
       "<!-- get_hour -->\n",
       "<g id=\"node3\" class=\"node\">\n",
       "<title>get_hour</title>\n",
       "<ellipse fill=\"none\" stroke=\"#000000\" cx=\"471.9827\" cy=\"-306\" rx=\"42.4939\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"471.9827\" y=\"-302.3\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">get_hour</text>\n",
       "</g>\n",
       "<!-- x&#45;&gt;get_hour -->\n",
       "<g id=\"edge3\" class=\"edge\">\n",
       "<title>x&#45;&gt;get_hour</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M387.0579,-364.7941C402.2769,-354.2578 423.888,-339.2964 441.5004,-327.1031\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"443.55,-329.9411 449.7797,-321.3713 439.5655,-324.1857 443.55,-329.9411\"/>\n",
       "</g>\n",
       "<!-- StandardScaler -->\n",
       "<g id=\"node5\" class=\"node\">\n",
       "<title>StandardScaler</title>\n",
       "<ellipse fill=\"none\" stroke=\"#000000\" cx=\"367.9827\" cy=\"-162\" rx=\"63.8893\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"367.9827\" y=\"-158.3\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">StandardScaler</text>\n",
       "</g>\n",
       "<!-- [&#39;clouds&#39;, &#39;humidity&#39;, &#39;pressure&#39;, &#39;temperature&#39;, &#39;wind&#39;]&#45;&gt;StandardScaler -->\n",
       "<g id=\"edge4\" class=\"edge\">\n",
       "<title>[&#39;clouds&#39;, &#39;humidity&#39;, &#39;pressure&#39;, &#39;temperature&#39;, &#39;wind&#39;]&#45;&gt;StandardScaler</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M237.3033,-216.3868C263.1397,-205.6341 296.1641,-191.8898 322.5498,-180.9085\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"324.0574,-184.0721 331.9449,-176.9984 321.3677,-177.6095 324.0574,-184.0721\"/>\n",
       "</g>\n",
       "<!-- target_mean_by_station_and_hour -->\n",
       "<g id=\"node4\" class=\"node\">\n",
       "<title>target_mean_by_station_and_hour</title>\n",
       "<ellipse fill=\"none\" stroke=\"#000000\" cx=\"540.9827\" cy=\"-234\" rx=\"133.4768\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"540.9827\" y=\"-230.3\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">target_mean_by_station_and_hour</text>\n",
       "</g>\n",
       "<!-- get_hour&#45;&gt;target_mean_by_station_and_hour -->\n",
       "<g id=\"edge2\" class=\"edge\">\n",
       "<title>get_hour&#45;&gt;target_mean_by_station_and_hour</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M487.9861,-289.3008C496.5517,-280.3627 507.277,-269.1711 516.8343,-259.1983\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"519.4679,-261.5087 523.86,-251.8671 514.414,-256.6654 519.4679,-261.5087\"/>\n",
       "</g>\n",
       "<!-- target_mean_by_station_and_hour&#45;&gt;StandardScaler -->\n",
       "<g id=\"edge5\" class=\"edge\">\n",
       "<title>target_mean_by_station_and_hour&#45;&gt;StandardScaler</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M499.5449,-216.7542C473.5485,-205.9349 440.0184,-191.9802 413.3147,-180.8665\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"414.3892,-177.5227 403.812,-176.9116 411.6995,-183.9854 414.3892,-177.5227\"/>\n",
       "</g>\n",
       "<!-- LinearRegression -->\n",
       "<g id=\"node6\" class=\"node\">\n",
       "<title>LinearRegression</title>\n",
       "<ellipse fill=\"none\" stroke=\"#000000\" cx=\"367.9827\" cy=\"-90\" rx=\"72.5877\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"367.9827\" y=\"-86.3\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">LinearRegression</text>\n",
       "</g>\n",
       "<!-- StandardScaler&#45;&gt;LinearRegression -->\n",
       "<g id=\"edge6\" class=\"edge\">\n",
       "<title>StandardScaler&#45;&gt;LinearRegression</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M367.9827,-143.8314C367.9827,-136.131 367.9827,-126.9743 367.9827,-118.4166\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"371.4828,-118.4132 367.9827,-108.4133 364.4828,-118.4133 371.4828,-118.4132\"/>\n",
       "</g>\n",
       "<!-- y -->\n",
       "<g id=\"node7\" class=\"node\">\n",
       "<title>y</title>\n",
       "<ellipse fill=\"none\" stroke=\"#000000\" cx=\"367.9827\" cy=\"-18\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"367.9827\" y=\"-14.3\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">y</text>\n",
       "</g>\n",
       "<!-- LinearRegression&#45;&gt;y -->\n",
       "<g id=\"edge7\" class=\"edge\">\n",
       "<title>LinearRegression&#45;&gt;y</title>\n",
       "<path fill=\"none\" stroke=\"#000000\" d=\"M367.9827,-71.8314C367.9827,-64.131 367.9827,-54.9743 367.9827,-46.4166\"/>\n",
       "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"371.4828,-46.4132 367.9827,-36.4133 364.4828,-46.4133 371.4828,-46.4132\"/>\n",
       "</g>\n",
       "</g>\n",
       "</svg>\n"
      ],
      "text/plain": [
       "<graphviz.dot.Digraph at 0x7fb1c02939a0>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.draw()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also use the `debug_one` method to see what happens to one particular instance. Let's train the model on the first 10,000 observations and then call `debug_one` on the next one. To do this, we will turn the `Bike` object into a Python generator with `iter()` function. The Pythonic way to read the first 10,000 elements of a generator is to use `itertools.islice`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-11-25T20:57:06.638894Z",
     "iopub.status.busy": "2020-11-25T20:57:06.594361Z",
     "iopub.status.idle": "2020-11-25T20:57:07.676817Z",
     "shell.execute_reply": "2020-11-25T20:57:07.677227Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0. Input\n",
      "--------\n",
      "clouds: 0 (int)\n",
      "description: clear sky (str)\n",
      "humidity: 52 (int)\n",
      "moment: 2016-04-10 19:03:27 (datetime)\n",
      "pressure: 1,001.00000 (float)\n",
      "station: place-esquirol (str)\n",
      "temperature: 19.00000 (float)\n",
      "wind: 7.70000 (float)\n",
      "\n",
      "1. Transformer union\n",
      "--------------------\n",
      "    1.0 Select\n",
      "    ----------\n",
      "    clouds: 0 (int)\n",
      "    humidity: 52 (int)\n",
      "    pressure: 1,001.00000 (float)\n",
      "    temperature: 19.00000 (float)\n",
      "    wind: 7.70000 (float)\n",
      "\n",
      "    1.1 get_hour | target_mean_by_station_and_hour\n",
      "    ----------------------------------------------\n",
      "    target_mean_by_station_and_hour: 7.97175 (float)\n",
      "\n",
      "clouds: 0 (int)\n",
      "humidity: 52 (int)\n",
      "pressure: 1,001.00000 (float)\n",
      "target_mean_by_station_and_hour: 7.97175 (float)\n",
      "temperature: 19.00000 (float)\n",
      "wind: 7.70000 (float)\n",
      "\n",
      "2. StandardScaler\n",
      "-----------------\n",
      "clouds: -1.36138 (float)\n",
      "humidity: -1.73083 (float)\n",
      "pressure: -1.26076 (float)\n",
      "target_mean_by_station_and_hour: 0.05496 (float)\n",
      "temperature: 1.76232 (float)\n",
      "wind: 1.45841 (float)\n",
      "\n",
      "3. LinearRegression\n",
      "-------------------\n",
      "Name                              Value      Weight     Contribution  \n",
      "                      Intercept    1.00000    6.58252        6.58252  \n",
      "                    temperature    1.76232    2.47030        4.35345  \n",
      "                         clouds   -1.36138   -1.92255        2.61732  \n",
      "target_mean_by_station_and_hour    0.05496    0.54167        0.02977  \n",
      "                           wind    1.45841   -0.77720       -1.13348  \n",
      "                       humidity   -1.73083    1.44921       -2.50833  \n",
      "                       pressure   -1.26076    3.78529       -4.77234  \n",
      "\n",
      "Prediction: 5.16889\n"
     ]
    }
   ],
   "source": [
    "import itertools\n",
    "\n",
    "X_y = iter(datasets.Bikes())\n",
    "\n",
    "model = compose.Select('clouds', 'humidity', 'pressure', 'temperature', 'wind')\n",
    "model += (\n",
    "    get_hour |\n",
    "    feature_extraction.TargetAgg(by=['station', 'hour'], how=stats.Mean())\n",
    ")\n",
    "model |= preprocessing.StandardScaler()\n",
    "model |= linear_model.LinearRegression()\n",
    "\n",
    "for x, y in itertools.islice(X_y, 10000):\n",
    "    y_pred = model.predict_one(x)\n",
    "    model.learn_one(x, y)\n",
    "    \n",
    "x, y = next(X_y)\n",
    "print(model.debug_one(x))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `debug_one` method shows what happens to an input set of features, step by step.\n",
    "\n",
    "And now comes the catch. Up until now we've been using the `progressive_val_score` method from the `evaluate` module. What this does it that it sequentially predicts the output of an observation and updates the model immediately afterwards. This way of doing is often used for evaluating online learning models, but in some cases it is the wrong approach.\n",
    "\n",
    "The following paragraph is extremely important. When evaluating a machine learning model, the goal is to simulate production conditions in order to get a trust-worthy assessment of the performance of the model. In our case, we typically want to forecast the number of bikes available in a station, say, 30 minutes ahead. Then, once the 30 minutes have passed, the true number of available bikes will be available and we will be able to update the model using the features available 30 minutes ago. If you think about, this is exactly how a real-time machine learning system should work. The problem is that this isn't what the `progressive_val_score` method is emulating, indeed it is simply asking the model to predict the next observation, which is only a few minutes ahead, and then updates the model immediately. We can prove that this is flawed by adding a feature that measures a running average of the very recent values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-11-25T20:57:07.683303Z",
     "iopub.status.busy": "2020-11-25T20:57:07.682821Z",
     "iopub.status.idle": "2020-11-25T20:57:37.282567Z",
     "shell.execute_reply": "2020-11-25T20:57:37.282106Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[20,000] MAE: 20.159286\n",
      "[40,000] MAE: 10.458898\n",
      "[60,000] MAE: 7.2759\n",
      "[80,000] MAE: 5.715397\n",
      "[100,000] MAE: 4.775094\n",
      "[120,000] MAE: 4.138421\n",
      "[140,000] MAE: 3.682591\n",
      "[160,000] MAE: 3.35015\n",
      "[180,000] MAE: 3.091398\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "MAE: 3.06414"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_y = datasets.Bikes()\n",
    "\n",
    "model = compose.Select('clouds', 'humidity', 'pressure', 'temperature', 'wind')\n",
    "model += (\n",
    "    get_hour |\n",
    "    feature_extraction.TargetAgg(by=['station', 'hour'], how=stats.Mean()) + \n",
    "    feature_extraction.TargetAgg(by='station', how=stats.EWMean(0.5))\n",
    ")\n",
    "model |= preprocessing.StandardScaler()\n",
    "model |= linear_model.LinearRegression()\n",
    "\n",
    "metric = metrics.MAE()\n",
    "\n",
    "evaluate.progressive_val_score(X_y, model, metric, print_every=20_000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The score we got is too good to be true. This is simply because the problem is too easy. What we really want is to evaluate the model by forecasting 30 minutes ahead and only updating the model once the true values are available. This can be done using the `moment` and `delay` parameters in the  `progressive_val_score` method. The idea is that each observation of the stream of the data is shown twice to the model: once for making a prediction, and once for updating the model when the true value is revealed. The `moment` parameter determines which variable should be used as a timestamp, while the `delay` parameter controls the duration to wait before revealing the true values to the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-11-25T20:57:37.293939Z",
     "iopub.status.busy": "2020-11-25T20:57:37.287952Z",
     "iopub.status.idle": "2020-11-25T20:58:06.811866Z",
     "shell.execute_reply": "2020-11-25T20:58:06.812301Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[20,000] MAE: 2.24812\n",
      "[40,000] MAE: 2.240287\n",
      "[60,000] MAE: 2.270287\n",
      "[80,000] MAE: 2.28649\n",
      "[100,000] MAE: 2.294264\n",
      "[120,000] MAE: 2.275891\n",
      "[140,000] MAE: 2.261411\n",
      "[160,000] MAE: 2.285978\n",
      "[180,000] MAE: 2.289353\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "MAE: 2.29304"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import datetime as dt\n",
    "\n",
    "model = compose.Select('clouds', 'humidity', 'pressure', 'temperature', 'wind')\n",
    "model += (\n",
    "    get_hour |\n",
    "    feature_extraction.TargetAgg(by=['station', 'hour'], how=stats.Mean()) + \n",
    "    feature_extraction.TargetAgg(by='station', how=stats.EWMean(0.5))\n",
    ")\n",
    "model |= preprocessing.StandardScaler()\n",
    "model |= linear_model.LinearRegression()\n",
    "\n",
    "evaluate.progressive_val_score(\n",
    "    dataset=datasets.Bikes(),\n",
    "    model=model,\n",
    "    metric=metrics.MAE(),\n",
    "    moment='moment',\n",
    "    delay=dt.timedelta(minutes=30),\n",
    "    print_every=20_000\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The score we now have is much more realistic, as it is comparable with the [related data science competition](https://maxhalford.github.io/blog/openbikes-challenge/). Moreover, we can see that the model gets better with time, which feels better than the previous situations. The point is that `progressive_val_score` method can be used to simulate a production scenario, and is thus extremely valuable.\n",
    "\n",
    "Now that we have a working pipeline in place, we can attempt to make it more accurate. As a simple example, we'll using a `EWARegressor` from the `expert` module to combine 3 linear regression model trained with different optimizers. The `EWARegressor` will run the 3 models in parallel and assign weights to each model based on their individual performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-11-25T20:58:06.824414Z",
     "iopub.status.busy": "2020-11-25T20:58:06.818480Z",
     "iopub.status.idle": "2020-11-25T20:58:46.438694Z",
     "shell.execute_reply": "2020-11-25T20:58:46.437892Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[20,000] MAE: 2.253263\n",
      "[40,000] MAE: 2.242859\n",
      "[60,000] MAE: 2.272001\n",
      "[80,000] MAE: 2.287776\n",
      "[100,000] MAE: 2.295292\n",
      "[120,000] MAE: 2.276748\n",
      "[140,000] MAE: 2.262146\n",
      "[160,000] MAE: 2.286621\n",
      "[180,000] MAE: 2.289925\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "MAE: 2.293604"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from river import expert\n",
    "from river import optim\n",
    "\n",
    "model = compose.Select('clouds', 'humidity', 'pressure', 'temperature', 'wind')\n",
    "model += (\n",
    "    get_hour |\n",
    "    feature_extraction.TargetAgg(by=['station', 'hour'], how=stats.Mean())\n",
    ")\n",
    "model += feature_extraction.TargetAgg(by='station', how=stats.EWMean(0.5))\n",
    "model |= preprocessing.StandardScaler()\n",
    "model |= expert.EWARegressor([\n",
    "    linear_model.LinearRegression(optim.SGD()),\n",
    "    linear_model.LinearRegression(optim.RMSProp()),\n",
    "    linear_model.LinearRegression(optim.Adam())\n",
    "])\n",
    "\n",
    "evaluate.progressive_val_score(\n",
    "    dataset=datasets.Bikes(),\n",
    "    model=model,\n",
    "    metric=metrics.MAE(),\n",
    "    moment='moment',\n",
    "    delay=dt.timedelta(minutes=30),\n",
    "    print_every=20_000\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
